1 Visualising (null) hypotheses

It’s important to think about what you expect data to look like before you start collecting it (or analysing it, if you’re using a secondary dataset). Let’s start with sketching.

1.1 Sketching

Think about the grid task: there is a 4x4 grid, on each trial 4 cells light up. The particpant gets rewarded for selecting any four cells that aren’t the ones that lit up. Here’s what we’re measuring, in vague terms:

  • Our indpendent variables are:
    • “generations” - we want to know if our dependent variables change as a result of cultural transmission
    • Something like “learner type” - we want to think about children vs primates, adults vs children, etc (you can consider this task in light of whatever species, developmental age, etc you like)
  • Our dependent variables are:
    • Performance: can they innovate?
    • Tetrominoes: Are they selecting more tetris peices than we would expect by chance?
    • Predictability: are they innovating in systematic ways, or just selecting random cells?

Using the Miro board at your table, sketch with your group what you think the results will look like under - conditions where this innovation task does not lead to CCE, versus - conditions where the innovation task does lead to CCE

In either case, you might have ideas about how different learner types perform differently; discuss this and try to integrate it into your sketches. Once you have some final sketches, take some screenshots of these. You’ll use these to compare with some real data as prt of the next exercise

1.2 Real data

That first exercise was designed to get you thinking about what you expect data to look like. Sketching is useful at this stage because you might not have access to the data yet, but it also sets up your hypotheses in a concrete visual way. This starts to form part of a pipeline for you work: you already know how you’ll need to visualise your data to check your hypotheses before you even have it.

Based on your sketches you may have come up with a variety of ways you’d like to look at the data. In this exercise, we’ll look at actual data from this task from Saldana et al (2019), which tested this grid/innovation task in both children and baboons. While there are many ways you could map your variables visually, we’ll choose one to move foward with. A bit later, we’ll discuss in more detail how to make decisions about dealing visually with variables.

  • “Generation” is on the x axis (more on the why of this later)
  • We’ll use different colours to represent different learner types (babboons vs children)
  • You generally want to compare each of your dependent variables to your independent variables separately - this means we want 3 different graphs, one for performance, one for predictability, and one for tetrominoes.

1.2.1 Understanding the data

Let’s start by looking at the raw data that was provided on OSF from Saldana et al (2019). I highly encourage you to look at the whole paper in detail eventually (if you haven’t already), but for now, this exercise might be more useful if you deal with the data a bit more naively.

The data come in two separate data files, dataBaboons.csv and dataChildren.csv, which are both in the data folder. As noted in the brief lecture, these are both in long format: this means that every line is a single response The first step is to read the data as variables:

children<-read_csv("data/dataChildren.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   GridSeen = col_double(),
##   GridDone = col_double(),
##   Generation = col_double(),
##   TrialNb = col_double(),
##   ChainNb = col_double(),
##   BinTetroDone = col_double(),
##   BinTetroSeen = col_double(),
##   Score = col_double(),
##   SymmetryBin = col_double(),
##   TrialName = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
#uncomment below and run the chunk to look at the head of the file; this will display the first 6 rows and give you a sense of what the columns are
head(children)
## # A tibble: 6 x 20
##   GridSeen GridDone Generation PartID PrevPartID TrialNb DateTime        ChainNb
##      <dbl>    <dbl>      <dbl> <chr>  <chr>        <dbl> <chr>             <dbl>
## 1     1158     1678          1 A_01   A_seed           1 Wed Jul 11 15:…       1
## 2     1274     1624          1 A_01   A_seed           2 Wed Jul 11 15:…       1
## 3     1467      841          1 A_01   A_seed           3 Wed Jul 11 15:…       1
## 4      207      479          1 A_01   A_seed           4 Wed Jul 11 15:…       1
## 5     1633        9          1 A_01   A_seed           5 Wed Jul 11 15:…       1
## 6      615     1811          1 A_01   A_seed           6 Wed Jul 11 15:…       1
## # … with 12 more variables: TetrominoDone <chr>, BinTetroDone <dbl>,
## #   TopBottomDone <chr>, RightLeftDone <chr>, TetrominoSeen <chr>,
## #   BinTetroSeen <dbl>, TopBottomSeen <chr>, RightLeftSeen <chr>, Score <dbl>,
## #   Symmetry <chr>, SymmetryBin <dbl>, TrialName <dbl>

Note: 1

After looking at the column headers, think about which columns we can dispense with for this particular visualisation. We’ll be making a copy of the dataset with particular columns, so you’re not getting rid of any data. You may not always want to do this, but it can make things a bit easier, and in this case, it’s necessary so that we can put the child and baboon data together to visualise them side by side (we need each of the datasets to have identical column headers in order to do this). For starters, we’ll deal only with generation, score, and proportion of tetrominoes

What columns do we want to keep from the data file?

  • Generation
  • Score
  • BinTetroDone

We also want predictability and learner type, but those aren’t there yet - how we get these will be come clear later. Before we get a leaner version of the child data, let’s read in the baboon data.

baboons<-read_csv("data/dataBaboons.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Name = col_character(),
##   Sex = col_character(),
##   TestingPhase = col_character(),
##   DateTime = col_datetime(format = ""),
##   TetrominoDone = col_character(),
##   TopBottomDone = col_character(),
##   RightLeftDone = col_character(),
##   TetrominoSeen = col_character(),
##   TopBottomSeen = col_character(),
##   RightLeftSeen = col_character(),
##   Symmetry = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
#uncomment below and run the chunk to look at the head of the file; this will display the first 6 rows and give you a sense of what the columns are
head(baboons)
## # A tibble: 6 x 22
##   GridSeen GridDone Name    Sex     Age TrialName Score TestingPhase TrialNumber
##      <dbl>    <dbl> <chr>   <chr> <dbl>     <dbl> <dbl> <chr>              <dbl>
## 1      883       26 EWINE   fema…    91         1     1 Test                   1
## 2     1301     1230 EWINE   fema…    91         1     0 Test                   1
## 3     1043      602 VIOLET… fema…   146         1     1 Test                   1
## 4     1280     1819 DAN     male    107         1     0 Test                   1
## 5      711      211 HARLEM  male     55         1     0 Test                   1
## 6     1622     1678 FANA    fema…    84         1     0 Test                   1
## # … with 13 more variables: Generation <dbl>, DateTime <dttm>, ChainNb <dbl>,
## #   TetrominoDone <chr>, BinTetroDone <dbl>, TopBottomDone <chr>,
## #   RightLeftDone <chr>, TetrominoSeen <chr>, BinTetroSeen <dbl>,
## #   TopBottomSeen <chr>, RightLeftSeen <chr>, Symmetry <chr>, SymmetryBin <dbl>

Here, there’s an additional column we need to pay attention to: TestingPhase. For the babboons, they had a control condition where there was no transmission: each babboon was given an entirely new random set of cells (rather than getting the one produced by the previous baboon), but with the same remit to innovate around the prompt rather than reproduce it. This is meant to test whether the transmission plays an important role in a non-copying task. For now, we only care about the transmission condition - we’ll need to use the values in this column to eliminate the non-transmission condition.

1.2.2 Shaping the data

First, we’re going to deal with performance and tetromino proportions; these are relatively simple for us to summarize because we can glom all trials from a given generation and learner type together. Calculating entropy requires a set of responses, so we have to first calculate a value for each paticipant, and then summarize across chains in a given generation (if you’re not familiar with entropy, read a bit more about it here - it’s a very useful descriptive measure which can be applied to many different kinds fo data). Let’s deal with the simple case first.

Let’s start with the filter() function to deal with the Testing Phase issue in the babboon dataset. We are interested in the transmission condition; to isolate this, we need to understand exactly how it’s coded in the data. Run the chunk below to find out.

unique(baboons$TestingPhase)
## [1] "Test"   "Random"

This shows us that the TestingPhase column contains the values “Test” and “Random”/ Notice how we can use $ to select a particular column from a dataframe. Based on the values coded in the data, replace VALUE HERE with the correct value below to filter out the trials we aren’t interested in.

baboonsTest<-baboons %>%
  filter(TestingPhase=="Test") 
#replace 'VALUE HERE' with the value you want tokeep
#If you don't replace this, what do you think will happen?
#also note that you can use != with the value you don't want, which is standard notation for "does not equal" 

There are a couple things to note about this code chunk. First is that we’re making a new dataframe, baboons_test, which starts with a copy of baboons that we’re modifying in a particular way. We’ll never be using the full baboons dataset in this exercise, but in general (assuming you don’t have storage or memory limitations given a very large dataset), making copies allows you to backtrack and preserve things where necessary.

The next think you’l notice is the %>% known in the tidyverse as a “pipe”. Pipes allow us to chain different commands together to do lots of things at once in a transparent way. Once we’ve gone through our summary step by step, we’ll look at how to do it all in the one go using these kinds of pipes.

Finally, note that while variable names in the tidyverse can generally be written without double quotes, variable values need to be within quotes if they are strings. So while TestingPhase doesn’t require double quotes, the value we’re selecting does need to be in double quotes. Also note that R can only find TestingPhase because it already knows (from the line prior) that we’re dealing with the baboons data frame.

Now we have a version of baboons, baboonsTest which isolates the test trials. Next, let’s isolate the variables we’re interested in - we’ll start by only looking at performance and tetromino proportions over generational “time”. Enter the variable (column) names we want to select as arguments to the select() function below. Note that we need to overwrite baboonsTest with a new version to integrate the select command.

baboonsTest<-baboonsTest %>%
  select(Generation,Score,BinTetroDone)

Now, do the same for the children, only we don’t need anything but the select() command because there is only one condition in this data frame. However, we do need to copy the original because we haven’t done that yet.

#copy the children data frame into a new data frame, childVis, and use a pipe to select only the columns we really need
childVis<-children %>%
  select(Generation,Score, BinTetroDone)

Now we have each of the datasets separately, but what we really want is to have the child data and baboon data together, so we can compare them visually. For this, we want to use rbind() which stands for “row bind” to basically stick these two data frames together. rbind() requires dataframes with identical column names - uncomment and run the chunk below and note how the rbind call throws an error.

#rbind(children, baboons)

You get the idea. But since we’ve just gone to great lengths to select identical columns from our dataframes, we can just get on with it, right? Well, not quite. Think about why that might not be a good idea for a second before continuing on…

If we throw the frames together now, something will be missing: which rows are children and which rows are baboons? This information wasn’t in the original dataframes, quite resonably, because it wouldn’t have varied between any of the rows in any way. But now that we’re glomming them together, we need to add this for the babboon dataset. Do the same for the child one, and then bind them together into allData

baboonsTest %>% 
  add_column(LearnerType="Baboon")
## # A tibble: 4,500 x 4
##    Generation Score BinTetroDone LearnerType
##         <dbl> <dbl>        <dbl> <chr>      
##  1          1     1            1 Baboon     
##  2          1     0            0 Baboon     
##  3          1     1            1 Baboon     
##  4          1     0            1 Baboon     
##  5          1     0            0 Baboon     
##  6          1     0            1 Baboon     
##  7          1     0            1 Baboon     
##  8          1     1            0 Baboon     
##  9          1     1            1 Baboon     
## 10          1     0            0 Baboon     
## # … with 4,490 more rows
#Now uncomment below to do the same for the childVis dataframe
childVis %>%
  add_column(LearnerType="Child")
## # A tibble: 1,800 x 4
##    Generation Score BinTetroDone LearnerType
##         <dbl> <dbl>        <dbl> <chr>      
##  1          1     1            1 Child      
##  2          1     1            1 Child      
##  3          1     1            1 Child      
##  4          1     1            1 Child      
##  5          1     1            0 Child      
##  6          1     1            1 Child      
##  7          1     0            1 Child      
##  8          1     0            1 Child      
##  9          1     1            1 Child      
## 10          1     1            1 Child      
## # … with 1,790 more rows
#Complete the rbind call to add them together
allData<-rbind(baboonsTest,childVis)

Below, check out how all of this this could be done using pipes, which allow us to chain commands together in a much cleaner bit of code.

childVis<-children %>%
  select(Generation, Score, BinTetroDone) %>%
  add_column(LearnerType="Child")

baboonsTest<-baboons %>%
  filter(TestingPhase=="Test") %>%
  select(Generation, Score, BinTetroDone)  %>%
  add_column(LearnerType="Baboon")

alldat<-rbind(baboonsTest,childVis)

1.2.3 Summarizing & Vizualizing the data

Now we have all the data together, we’re ready to summarise and look at our variables. Note that depending on the kind of data you have, it might not make sense to do summary statistics at this point. For example, if you have truly continuous variables, e.g., amount of food eaten (weight in g) and temperature, you might want to start with a scatterplot of the raw data rather than calculating means. However, in this case, we have binary outcome variables (success at innovation or not, produced a tetromino or not). Plotting these without summarizing them would be visually useless. Take a quick look at the plot below (by executing the code chunk) to see why.

pointscore<-ggplot(data=alldat,aes(x=Generation, y=Score))+
  geom_point()

pointscore

All of the values are at 0 or 1, and ggplot is simply drawing them on top of each other, so we really can’t see much of anything. Note that this is because when we code categorical data, we’ve done something kind of sneaky and assigned it numbers (0 or 1) - if you looked carefully at the information in the head() call for the original data frame (or the column specifications spit out by read_csv()), you’ll notice that Score is a double. This means that R has assumed this is a float/decimal type number. This is actually deliberate on the part of the authors, because this categorical variable needs to be used to calculate proportions, but it’s caused ggplot to make an incorrect assumption: that values between 0 and 1 are possible. We’re going to calculate proportions that will give us these kinds of values in the end, but remember that even though we might visualise this as a continuous variable, it isn’t one for analysis purposes (i.e., you would need to use logistic rather than linear regression).

Before we make this more useful to look at, let’s use the code for this relatively useless graph to start to understand how ggplot works.

`pointscore<-ggplot(data=alldat,aes(x=Generation, y=Score, colour=LearnerType))+ geom_point()

pointscore `

First, we’re putting our plot into a variable called pointscore. You don’t have to do this - you could just call ggplot(...)+ and it would spit out a plot. You might sometimes prefer this if you’re playing around with looking at some data. But generally, assigning your plot to a variable is cleaner and allows us to play with some aesthetics later without actually changing the original plot, e.g., I can test what happens if I add lines, while still preserving the original graph.

pointscore+geom_line()

pointscore

Adding lines at this point doesn’t make this any less visually meaningless - but let’s push on with understanding some basics of ggplot before we tidy it up.

Inside the initial call to ggplot() we need two things, a data frame that we want ggplot to draw (data = alldat), and the specific aesthetics we want drawn (aes(x=Generation, y=Score)). On top of this, we layer our desired geom(s) - which is essentialy the kind of plot we want: we started wtih points and then added lines. If you don’t add any geoms, ggplot will give you a blank plot - you’ve told it what to plot, but not how to plot it:

ggplot(data=alldat,aes(x=Generation, y=Score))

Layering geoms works how you would expect: the lines will be drawn on top of the points because we called geom_line() after calling geom_point(). You can do this in lots of different ways, including leaving the dataset and aesthetics out of the main ggplot call and instead specifying it inside the geom. The code below is identical to our inital plot:

pointscore<-ggplot()+
  geom_point(data=alldat,aes(x=Generation, y=Score))

#pointscore

We’ve gone to the trouble of tidying our data into a single dataframe, so going forward we’ll specify the data in the ggplot call (and this is generally the preferred method). However, I point this out because it’s worth noting that you can visualise multiple datasets on the same plot by using the layering afforded by passing the data/aesthetics to the geoms.

In general, don’t forget your geoms, and make sure the way you’re drawing your data lines up with the kind of data you have: ggplot will not prevent us from plotting a binary outcome variable as a scatterplot using geom_point() (because it interprets it as a double) even though it’s useless. We’ll talk more about how to make good visualisation decisions later on.

However, for now, we need some kind of summary of these binary outcome variables in order to visualise what’s happening. Are scores changing over generations? Are children and baboons performing differently? To see this, we need to calculate the proportion of correctly innovated trials per generation; since the data was coded as 0 for a missed trial and 1 for a successful one (which is a generally useful convention), we can use a simple mean to do this, creating a new variable called meanScore using the tidyr summarise() function.

performance<-alldat %>%
  summarise(meanScore=mean(Score))
head(performance)
## # A tibble: 1 x 1
##   meanScore
##       <dbl>
## 1     0.861

This is great and all - we can see that the overall performance between children and babboons was about 86%, but we’ve lost a bunch of information. We need to add the use of group_by with our indpendent variables.

  • Pass the relevant column headers for the independent variables to the group_by function below
  • Add the standard error using the MeanSE() function (you can add as many new variables to summarise() as you like.

Add the relevant bits to the code chunk below, uncomment, and run the code. Look at the head of the new performance data frame and note the difference. (also note that this is overwriting the previous value of the variable performance):

performance<-alldat %>%
  group_by(Generation, LearnerType) %>% #add the IVs of interest here
  summarise(meanScore=mean(Score),se=MeanSE(Score)) #use MeanSE() to add the standard error
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
head(performance)
## # A tibble: 6 x 4
## # Groups:   Generation [3]
##   Generation LearnerType meanScore     se
##        <dbl> <chr>           <dbl>  <dbl>
## 1          1 Baboon          0.711 0.0214
## 2          1 Child           0.861 0.0258
## 3          2 Baboon          0.818 0.0182
## 4          2 Child           0.944 0.0171
## 5          3 Baboon          0.771 0.0198
## 6          3 Child           0.917 0.0207

Note: 2

Now that this has the relevant information we need, we can plot the means over time:

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore))+
  geom_point()

scores

Looks much better, and we can see kind of an upward trend (?), but we’re still missing some information - we need to add an aesthetic for the LearnerType; we can’t tell which points are children and which are babboons. Map LearnerType to the aesthetic colour

#add colour in the list of aesthetics (aes), and map it to LearnerType
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
  geom_point()

scores

Now we’re getting somewhere. See if you can add some stuff on your own:

  • Add a line
  • Add errorbars using the value of the standard error we built into the performance summary - the docs will helpful here, but also try the best way to troubleshoot: Google “errorbars in ggplot” and find your own resource
  • This is nice, but it doesn’t show where random performance would be. In other words, randomly clicking cells would give about a 65% chance (or 0.65) of scoring 1 on any given trial. How might you add a representation of chance performance to the plot? (hint here, but also use the google…)
#Add a line, errorbars, and a representation of chance performance to the original plot

scores<-scores+
  geom_line()+
  geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
  geom_hline(aes(yintercept=0.65))

scores

  • Finally, apply the same steps to the BinTetroDone variable (summarising into a new data frame, followed by plotting) to graph the proportion of tetrominoes produced over time in each group.
tetrominoProp<-alldat %>%
  group_by(Generation, LearnerType) %>% 
  summarise(tetProp=mean(BinTetroDone),se=MeanSE(BinTetroDone))
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
tetrominoes<-ggplot(data=tetrominoProp,aes(x=Generation,y=tetProp,colour=LearnerType))+
geom_point()+
  geom_line()+
  geom_errorbar(aes(ymin=tetProp-se,ymax=tetProp+se))+
  geom_hline(aes(yintercept=0.005))

tetrominoes

1.3 Getting more complex: Entropy

So far we’ve been dealing with variables that are pretty straightforward: for each trial we had a binary outcome (did they succeed at the trial or not, did they create a tetromino or not) that we used to create a propoortion (by averaging the binary outcomes across both trials and participants for each generation). However, the “predictability” of the cells selected is a more complex variable. For this, we have to calculate the entropy of the set of values of the GridDone variable for each participant.

Do your best to translate the steps below into code - feel free to ask questions if you get stuck!

  • Create copies of the original dataframes which preserve Generation, ChainNb, and GridDone, using only the testing phase for baboons.
  • Bind these into a new dataframe (don’t forget to add columns for LearnerType first)
  • Use group_by and summarise to get an Entropy value for each generation/chain/learner type
    • Use the Entropy function from DescTools; note that this takes a frequency table as an argument. Wrap the GridDone variable in R’s table() function, which automatically calculates a frequency table. Use base=exp(1) as an argument to the entropy function to get the exact results from the original paper.
  • Use the group_by and summary functions on the new summarised dataframe (creating a third and final dataframe), to calcluate the mean and SE of entropy values across chains in each generation.
  • Plot these using the same concepts you used to plot the scores and the proportion of tetrominoes.
childEnt<-children %>%
  select(Generation, ChainNb, GridDone) %>%
  add_column(LearnerType="Child")

baboonEnt<-baboons %>%
  filter(TestingPhase=="Test") %>%
  select(Generation, ChainNb,GridDone)  %>%
  add_column(LearnerType="Baboon")


allEnt<-rbind(childEnt,baboonEnt)

sumEnt<-allEnt %>%
  group_by(Generation, ChainNb, LearnerType) %>%
  summarise(Entropy=Entropy(table(GridDone),base=exp(1)))
## `summarise()` has grouped output by 'Generation', 'ChainNb'. You can override using the `.groups` argument.
meanEnt<-sumEnt %>%
  group_by(Generation, LearnerType) %>%
  summarise(meanEntropy=mean(Entropy),se=MeanSE(Entropy))
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
predictability<-ggplot(data=meanEnt, aes(x=Generation, y=meanEntropy, colour=LearnerType))+
  geom_point()+
  geom_line()+
  geom_errorbar(aes(ymin=meanEntropy-se,ymax=meanEntropy+se))

predictability

1.4 Expectations vs Results

Finally, let’s compare the plots we made to our original expectations, and also to the actual plots from the published paper. - Start by uploading screenshots of your sketches into the project - Use ggsave() to save images of your plots to display next to the other plots (the fact that your final plots are saved to variable names should come in handy). - Plots from the actual publication are already in data/publishedPlots/ within the project - Fill in the correct paths/filenames and execute the code chunks below to compare.

Comparing Scores VisuallyComparing Scores VisuallyComparing Scores Visually

Comparing Scores Visually

Comparing Tetromino Proportions VisuallyComparing Tetromino Proportions VisuallyComparing Tetromino Proportions Visually

Comparing Tetromino Proportions Visually

Comparing Entropy VisuallyComparing Entropy VisuallyComparing Entropy Visually

Comparing Entropy Visually

  • Do the data match your expectations, or are they closer to the null hypothesis case (assuming these differed)?
    • The data more or less confirm that scores go up over time and are better than random, that tetrominoes are much more common than random from the outset (although they don’t increase much). Statistical models in the published paper confirm this.
  • Why might the sketch of predictability differ from the diversity/entropy measures shown in plots calculated from the actual data (in both the exercise and publication)?
    • I introduced the concept of “predictability” to think about the results and implemented this in my sketches (you probably did too). I did this because this can often be easier to wrap your head around. Predictability is something like the inverse of Shannon Entropy - the higher the entropy, the less predictable a participants’ grid set is (if you had to guess what four cells they had chosen on a given trial, higher entropy means it’s a lot less likely you’d be able to make a correct prediction, because all of their responses are likely to be different). So, entropy (or “diversity”, as they call it in the paper) going decreasing over “time”, is, in fact, equivalent to an increase in predictability over time. In other words, while my sketch looks different from the actual graph, the data does in fact align with my assumptions that predictability would increase over time (more or less).
    • We could normalise the Shannon Entropy and and then plot the inverse if we preferred to tell a visual (and corresponding prose-based) narrative about increase in predictability (rather than decrease in entropy) over “time”. You might consider this easier for a reader/viewer to understand, so it could be worth putting the work in. The upper and lower bounds of entropy will depend on your dataset, so normalization is necessary prior to taking the inverse.
    • If you’ve got time to dig into this:
      • start with the total number of possible discrete 4-cell grids - it’s \[\frac{n!}{(n-r)!}\], where \[n\] is the total number of cells (16) and \[r\] is the number of cells that can be selected on a given trial (4).
      • You’ll have to look to the paper to find how many trials each participant completed to find the upper/lower bounds of entropy for a response set.
      • Do you think this makes for a more compelling visual narrative?
  • While our plots show the same thing as the published ones, they don’t look as nice; what’s different?
    • There’s lots of differences, but we’re missing informative axis labels, our colours are the defaults, and the assumptions ggplot has made about the kinds of variables we have has led to some weird axis ticks (e.g., it assumes Generation is a double/float rather than an integer, and so has put the axis breaks in odd places).
    • The scale, particularly on the y-axis, also differs markedly in the first two plots (especially the second one). This is because the addition of horizontal lines which show random performance has led to these plots showing more of the y-axis than the published ones. We’ll talk about this issue in the second half, but think about how this changes the way the data looks, and whether it makes for a more (or less) compelling visual in each case.

In the second half of the workshop, we’ll work on tidying these up and making camera-ready visualisations.

2 Camera Ready Visualisation

In the last session we made decent progress with using R to make basic visualisations and compare those to our sketched expectations. However, we ended on comparison which included some major differences between our graphs and the publication-ready ones: ours generally didn’t look as nice, and the ggplot() defaults didn’t necessarily represent our data in the best way. In this section session, we’ll talk briefly about good practice in data visualisation, before moving onto how to tidy things up to make our plots look better.

Finally, I’ll direct you to some resources on network data visualisation, which is considerably more complex, and time permitting, we’ll start to fiddle with some network data using the igraph and ggnetwork packages.

2.1 Principles of good visualisation

  • Bad practice
    • Don’t overcomplicate tings. Highlight relevant information, but don’t add redundant information unless it’s genuinely helpful.
    • Be mindful about axis ranges - especially the defaults of plotting software. Software will generally truncate the axes according to the range of your data, but this might mislead the viewer about the range of the scale (or behaviour) and the strength of the effect(s).
    • Be careful about sorting purely nominal categories, and they should never be connected unless the line can mean something (which it will only do if they are ordinal in some sense)
    • Pie graphs aren’t great. We are bad at interpreting radial proportions, so they tend not to be very effective. Stacked bar charts are a better choice.
    • If you must use a pie chart, make sure there are only 3-4 categories, and pay particular attention to the use of colour.
  • Good practice
    • You don’t want to massage your data to fit a narrative, but visually highlighting aspects of the data that support your narrative is a good idea. Annotating graphs is good. Make life easy for your reader/viewer.
    • Use visual elements strategically. Use colour, line, text, etc., to highlight relevant things for the viewer.
    • Think carefully about what viewers expect from colour, size, form, etc - e.g., using blue for a hot temparature doesn’t make much sense. This work on visual metaphor in graphs is very useful.

2.2 Tidy graphs

We ended the last session by thinking briefly about the differences between our graphs and the graphs that were published of the baboon and child cultural transmission data. The major differences are:

  • Axes are labelled differently
  • The scale of the axes is different
  • Axes don’t break where we would expect
  • The colours are different
  • The entire visual vibe of the graph is different (grey background, different font, etc)

First, we’ll deal with the axes, then the colour, and then the general visual feel of the graph, which is also known as the theme.

2.2.1 Tidy Axes

2.2.1.1 Labelling & Limits

Labelling your axes is simple in ggplot using xlab() and ylab().

scores<-scores+ylab("Proportion Successful Trials")
scores

Using this as a template, add labels to the axes for the performance and tetromino graphs.

tetrominoes<-tetrominoes+ylab("Proportion Tetromino Responses")

predictability<-predictability+ylab("Mean Entropy of Response Sets")

We might also want to more accurately label our LearnerType legend. While we generally don’t want spaces in our variable names within dataframes3, we probably do want this for readability on graph labels. We’ll learn how to change this when we deal more generally in scales later on.

First, let’s add a label to the line that denotes random performance (and make it dashed so it’s a bit more obvious it’s not just the bottom of the graph). Look at the documentation for annotate(). Add a label near the random performance line; I’ve already made the line dashed, but note that this has required re-doing the entire ggplot call rather than just adding to the existing plot. Why do you think this is necessary in this case?4

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
  ylab("Proportion Successful Trials")+
  annotate("text",x=2,y=0.65,label="Chance\nPerformance")

scores

Finally, we can change the limits of the axes - per the do’s and don’t’s of good datavis, we might want to show the entire range of the variable on the yaxis. We can do this easily with ylim():

scores+ylim(0,1)

This certainly changes how our results look. The rise in performance now doesn’t look as stark as it did before. However, you’ll notice I used scores+ylim(0,1) rather than redifining the variable scores as scores<-scores+ylim(0,1). This is because while this is illustrative of the ylim() argument (and there’s a corresponding xlim()), showing the whole range doesn’t make much sense in this case. The relevant benchmark for performance in this task is the random chance line we’ve added, not 0; proportions closer to 0 would indicate a bias for actually copying grid cells despite the constraints of the task disfavouring this. In other words, it’s not especially suprising values down near zero (even below 0.5) aren’t in our data, so not showing them is fine.

The fact that truncatng the y-axis in this case dovetails with making the results look better is a bonus. We’ve even added something the published graph is missing, which is a representation of random performance (the actual “floor” we’d expect in this data"). Remember to be careful not to truncate the axes just because it makes the results look nice - make sure they are also actually nice relative to your expectations.

Once you get around to prettifying the tetromino graph, think about this issue again. The published graph here also has a very truncated y-axis - does it also make sense for that variable? Why or why not?5

2.2.1.2 Breaks & Scales

So far we’ve made some changes using ylab() and ylim(). These are handy for making quick changes only to these attributes - we might leave it at that if we didn’t have any other issues. However, our x-axis is doing somethign weird whith where it’s deciding to break, at 2.5, 5.0, 7.5, and 10.0, interpreting Generation as a double. However, the value 2.5 isn’t possible for this variable, and where the labels on the axis are is a bit strange.

Enter scale...() layers. We can add these to our plot to specify things like the limits, label, breaks, etc about a scale at the same time. I’ll model this for the y axis of the graph, making some changes we don’t necessarily need, but that might look good:

scores<-scores+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")

scores

You can see that I’ve specified the limits and the breaks I want by passing a vector using c() - note that even if you want e.g., only a single break, you need to pass a vector. Now, apply these concepts to scale_x_continuous(), giving it a range between 0 and 10 with breaks at each generation. Rename it as “Generation (Time)” for funsies:

scores<-scores+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")

scores

You’ll notice this has done something a little weird with the errorbars - they are cut off for the first and last generation. This is because ggplot is trying to use coordinates <1 and >10 to draw these lines out horizontally. We’ll deal with this later by changing geom_errorbar() to geom_pointrange(), which removes the horizontal bars.

Now we have one more scale to deal with - we want a space in the LearnerType. This is as simple as using the name argument in scale_colour_discrete(). Also change ‘Child’ to ‘Human Child’ using labels=c() - note that for discrete scales, it will expect as many labels as there are values in the c() vector you pass. Also be mindful of the existing order of the values - ggplot will let you label them incorrectly. Try it:

scores<-scores+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))

#below would throw an error because you've only passed one label
#scale_colour_discrete(name="Learner Type", labels=c("Human Child"))

#below would run perfectly, but now the baboon data would be labelled "Human Child" because you've reversed the order
#scale_colour_discrete(name="Learner Type", labels=c("Human Child","Baboon"))
scores

Note that we’ve used a slightly different discription of the scale here, ending in _discrete rather than _continuous - this is because we’ve mapped colour to a discrete variable rather than a continous one. Likewise, if you had a completely categorical x axis (something like e.g., nationality), you’d have to use the function scale_x_categorical() to specify the properties of the scale.

2.2.1.3 Other tidying

There are a few more tidbits we want to tidy up before we move on. First, let’s look at the plot call as a whole - we can lose sight of this using the scores+ method, and from now on we’ll have to edit the entire chunk of code because we’re changing values in particular existing functions, as opposed to adding entire functions. Normally, you’d do this in a single chunk and run it repeatedly as you make changes. However, we’ll do this sequentially as part of the exercise (though this means some ugly code repetition you wouldn’t normally have in your own pipeline).

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
  annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))

scores

First, let’s deal with the errorbars that are getting cut off - change geom_errorbar to geom_pointrange and look at the difference.

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  #change geom_errorbar to geom_pointrange
  geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se))+
  annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))

scores

That already looks much cleaner, and it’s added points which are useful for seeing where the actual means are more clearly. There’s one final change that we can make to align the aesthetics of the data with the published graphs: they’ve used the shape of the points to contrast between the children and baboons. We can do the same - add a shape aesthetic to geom_pointrange(), and map it to LearnerType.

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  #change geom_errorbar to geom_pointrange
  geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
  annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))

scores

This has done something slightly undesirable: it’s given us a separate legend for colour and shape - this because towards the end of the code block we’ve redefined the properties of the colour scale. Since these don’t match between shape and colour, ggplot has created two different legends. Make scale_shape_discrete() identical to scale_colour_discrete(), and it will automatically merge these:

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  #change geom_errorbar to geom_pointrange
  geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
  annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
  scale_shape_discrete(name="Learner Type", labels=c("Baboon","Human Child"))

scores

Finally, the legend hanging way out on the right there is making this plot a lot wider than it needs to be, and this is kind of distracting. Let’s move it into the plot itself, on the lower right where there’s not much going on. Google change legend position ggplot to figure out how to do this - this will give you a perview of the theme() function, which we’ll look at shortly. Note that while using annotate() used the x,y values on the plot determined by your data, all positioning within theme() uses x,y coordinates between 0 and 1, where the top left of the plot is 0,0, and the bottom right is 1,1. We’ll be creating a custom theme that you can keep and apply to all your plots.

scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  #change geom_errorbar to geom_pointrange
  geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
  annotate("text",label="Chance\nPerformance",x=1.5,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
  scale_shape_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
  theme(legend.position=c(0.8,0.3))

scores

Once you’ve had a go, think quickly before moving onto colour and shape about why we wouldn’t want to put legend.position into a custom theme that we’d apply generally to all your plots, even if you want all your legends to be inset within your plots6

2.2.2 Custom Colours & Shapes

Now let’s think about our colours and shapes. The colours and shapes ggplot has used to denote baboons vs children are the defaults. You probably don’t want to use these for a few reasons.

  • They aren’t that good. The shapes are kind of fine, but the colours are bad - the default ggplot colours in particular are bad for colour blindness, which occurs in about 8% of the population. You want everyone to be able to read your graph (this is also why using shape redundantly is a good idea in this case - it adds value, not just visual noise)
  • Even if the defaults are okay, you should be actively deciding to keep them: overall, your visualisation decisions should be made by you, not software. Customisation is a major feature of ggplot - use it!
  • This plot screams “ggplot defaults” even after we’ve gone to the trouble of fiddling with it a whole lot. You want your plots to look customised.

Luckily, all of this can be done by messing with our scales, and adding specific values(). However, because we’re manually over-riding some basic defaults, we now need to change from scale_whatever_discrete() to scale_whatever_manual(). Alter the code below to change the discrete scales to manual scales, and add values.

  • For colours, you can use
    • basic colour terms (e.g., c(“red”, “blue”))
    • create a custom palette using a generator or colour picker (my favourite is coolors.co - enter the hex values in quotes as a string, e.g., c(“#9E1946”,“#4D6CFA”))
    • use R Colourbrewer -For shapes, look at the shape values built into ggplot
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
  geom_hline(yintercept=0.65, linetype="dashed")+
  geom_line()+
  #change geom_errorbar to geom_pointrange
  geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
  annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
  scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
  scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
  theme(legend.position=c(0.8,0.3))

scores

We’ve done a fair bit to this plot, but it still looks generic, and very unlike the published plots. The remaining aesthetics - font, background colour, etc. - are related to the theme() rather than the scales. In the next section, we’ll make a custom theme for our plots.

2.2.3 Custom Themes

Start out by looking at the default themes available in ggplot2 - you can simply add these to any plot to make fairly immediate changes. My favourite is theme_bw(), illustrated below, but I encourage you to try others.

scores+theme_bw()

Note, however, that this has messed with our legend, because it constitiutes a second call to theme(), overriding the first. We need to set a custom theme, and then the position called within the later theme call will apply in addition this. Below is my preferred theme - notice that that it elaborates upon theme_minimal(). Also note that theme_set() overrides existing theme pre-sets, and then the existing call to theme() within scores adds the legend in the correct position.

theme_set(theme_minimal()+theme(text = element_text(family = "Times",size=15),plot.title = element_text(hjust = 0.5)))

scores

Fiddle with this to create your own theme, and then look at the predictability and tetrominoes plots again. Note that you can set your plotting theme at the start of you R Session or on the first code chunk in RMarkdown where you generally load packages - it will apply this theme to all the plots you generate during the session. Now, move on to applying this elsewhere, and debugging one final issue:

  • Alter the predictability and tetrominoes plots in the same way we did the scores plot so it has a matching aesthetic, particularly in terms of the scales.
tetrominoes<-ggplot(data=tetrominoProp, aes(x=Generation, y=tetProp,colour=LearnerType, shape=LearnerType))+
  geom_hline(yintercept=0.0005, linetype="dashed")+
  geom_line()+
  geom_pointrange(aes(ymin=tetProp-se,ymax=tetProp+se, shape=LearnerType))+
  annotate("text",label="Chance\nProportion",x=2,y=0.07)+
  scale_y_continuous(limits=c(0,1), breaks=c(0,0.5,1.0),name="Proportion of Tetrominoes Produced")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
  scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
  theme(legend.position=c(0.8,0.3))

tetrominoes

predictability<-ggplot(data=meanEnt, aes(x=Generation, y=meanEntropy, colour=LearnerType))+
  geom_line()+
  geom_pointrange(aes(ymin=meanEntropy-se,ymax=meanEntropy+se,shape=LearnerType))+
  ylab("Mean Entropy of Response Set")+
  scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
  scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
  scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
  theme(legend.position=c(0.5,0.8))

predictability

  • If you’ve kept the font-family in your theme as “Times” as I have it in mine, you’ll notice that the annotations in the score and tetromino graphs are still in a san-serif font, despite us having altered the theme. How do you think we would fix this? (hint here)

2.3 Advanced: Network Visualisation

Network visualisation is an advanced topic, and one we won’t have much time to dig into. There are three reasons for this: - Assuming at least some attendees are just starting out with ggplot, we probably don’t have time. However, if you’re already fairly advanced with this, you might blow through these exercises and get here on your own. - Network visualisation is advanced. Eyeballing data like the proportions we’ve just looked at, or many other categorical or continuous variables, is inherently useful. This is not necessarily the case for networks: the larger a network is, the more useful it is to analyse, but the less useful it is to visualise. Large networks quickly get visually garbled - you’re better off comparing e.g., continuous attributes of nodes with their degree using a scatterplot. - I haven’t personally used R for network visualisation very much; I’ve used d3js, which is an entirely different animal (it’s based on JavaScript rather than R, and uses JSON style data in lieu of spreadsheet style dataframes. But it’s very useful to learn if you’re keen on visualisation).

Regardless, maybe we can make some headway on network visualization in R together. One of the things I hope you take away from this workshop is that once you know some basics, you can google almost anything - there are tons of online resources for R. For edaxample, this workshop by Katy Ognyanova looks to be a very detailed workshop on network visualisation from an expert. While teaching yourself on the internet might not make for the fastest progress, you will progress, and you’ll learn concepts more deeply applying them to problems that are inherently meaningful to you rather than following arbitrary tutorials.

Below, I’ve started to fiddle with the igraph() package (docs here) and the ggnetwork()package, which allows you to draw networks using ggplot style syntax. I’ve used data from Wild et al., 2019 which deals with the diffusion of sponge foraging in Dolphins (it has some authors that might be familiar to you from elsewhere in the workshop, including Sonja Wild and Will Hoppit). The data is inside the networkData folder - take a closer look at the paper to learn more about it. Note, however, related to my point above about the utility of network visualisation: the paper doesn’t actually include any network graphs - this is probably because with networks this large, they aren’t terribly useful. It’s nonetheless useful data to use to start to learn how to visualise networks in R. However, if you have some of your own network data to play with, this will be even more useful.

Below, I’ve made a crack at looking at relatedness among sponge foragers, which, in and of itself didn’t account for much diffusion of foraging strategies in Wild et al.’s findings. I’ve run up against some walls already, which I’ve noted - mainly,i need to think of ways to make the network smaller before it can be useful to look at. See if you can make progress, or use the other data files (particularly social vertical and horizontal relatedness) to check out other ways of looking at the network.

library(igraph)
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:DescTools':
## 
##     %c%
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## The following object is masked from 'package:tidyr':
## 
##     crossing
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggnetwork)
#load relatedness matrix
relatedness<-read_csv("networkData/relatedness.csv")
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
#load individual attributes
indVars<-read_csv("networkData/ILVs.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   id_individual = col_character(),
##   Sex_1_0 = col_double(),
##   Haplotype = col_character(),
##   Number_sightings = col_double(),
##   Av_water_depth = col_double(),
##   Av_group_size = col_double(),
##   Sponger = col_character(),
##   Demons_sponging_forage = col_character(),
##   Sp_Order_acquisition = col_double(),
##   Mum_known = col_character(),
##   Not_weaned = col_double()
## )
#filter out dolphins who had less than 7 sightings, these were excluded from analysis
indVars<-indVars %>%
  filter(Number_sightings>7, Sponger=="yes")

#use a copy of the IDs as a vector so we can apply this to the matrix
validDolphins<-as.vector(indVars$id_individual)

rel<-relatedness %>%
  #filter out rows taht have dolphins with few sightings
  filter(X1 %in% validDolphins) %>%
  #select columns that are valid dolphins
  select(one_of(validDolphins))

g<-graph_from_adjacency_matrix(as.matrix(rel),weighted="relcoef") %>%
  set_vertex_attr("AvgGroupSize",value=indVars$Av_group_size)

#delete edges with very low relatedness?
#g<-delete_edges(g, which(E(g)$relcoef<0.1))
#above works, but not so well without also deleting notes that have no edges after this point
# basic format for visualising the network
nettest<-ggplot(g, aes(x=x,y=y,xend=xend,yend=yend))+
  geom_edges(aes(alpha=relcoef),curvature=0.1)+
  geom_nodes(aes(size=AvgGroupSize), colour="blue")+
  theme_blank()
nettest


  1. read_csv spits out some informaiton about Column Specifications. Can you tell what this is all about? Why would you want this?↩︎

  2. group_by() is an incredibly useful function, especially with summaries, but there is a conflict between the tidyverse package underlying this, dplyr, and another common package used for data manipulation plyr. If you’ve loaded plyr after dplyr, group_by() (and potentially some other dplyr functions) won’t work properly. Even if you don’t think you’re using plyr, it’s a dependency of many other R packages and might be loaded without you really knowing this is happening. The tidyverse is set to throw a warning about this when plyer is loaded, but you might miss it (and then be mystified as to why group_by is failing to work), so keep it in mind. You may need to fiddle with the order in which you load packages or detach packages before trying to use group_by()↩︎

  3. In fact, R won’t allow this. If your variable names have spaces in them in e.g., an input .csv or Excel file, R will replace the spaces with a dot, such that e.g., “Learner Type” would become “Learner.Type”.↩︎

  4. The reason is because otherwise it’s just drawing a dashed line over the solid line I put in the original plot, so we can’t see it. We need to remove that original line. To see this, first try doing scores+geom_hline(yintercept=0.65, linetype="dashed",colour="red") - you’ll be able to see it draw a dashed red line over the solid black one.↩︎

  5. I think truncating the y-axis in the tetromino case is actually a bit suspect, and misrepresents the results a bit. The chance of randomly producing at tetromino is much lower, something like 0.002, so truncating the y axis around 0.6 is a bit odd. What this does is make the variation in tetromino proportions look fairly intense, and quite disparate between children and babboons. However, when you look at our graph where we’ve added the baseline (and thus, extended the y-axis range automatically, by asking ggplot to add something at a y-intercept of 0.002), there’s less of a distance between baboons and children, and it’s clearer they’re both performing way above baseline consistently, and in fact, close to ceiling.↩︎

  6. The reason not to put this in a general theme is because a) it won’t always be possible to put the legend within the plot; sometimes the plot is just full of data! and b) Even if it is possible to put the legend in the plot, where it will go will change from plot to plot, depending largely on where there might be a little space. Therefore, we don’t really want this to be part of a theme we apply to all our plots (like font, font size, axis label tilting, etc) because there is no one-position-fits-all solution here.↩︎